99 research outputs found

    Minimizing Communication for Eigenproblems and the Singular Value Decomposition

    Full text link
    Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds were presented on the amount of communication required for essentially all O(n3)O(n^3)-like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs.Comment: 43 pages, 11 figure

    Communication-optimal Parallel and Sequential Cholesky Decomposition

    Full text link
    Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.Comment: 29 pages, 2 tables, 6 figure

    A 3D Parallel Algorithm for QR Decomposition

    Full text link
    Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs

    Faster all-pairs shortest paths via circuit complexity

    Full text link
    We present a new randomized method for computing the min-plus product (a.k.a., tropical product) of two n×nn \times n matrices, yielding a faster algorithm for solving the all-pairs shortest path problem (APSP) in dense nn-node directed graphs with arbitrary edge weights. On the real RAM, where additions and comparisons of reals are unit cost (but all other operations have typical logarithmic cost), the algorithm runs in time n32Ω(logn)1/2\frac{n^3}{2^{\Omega(\log n)^{1/2}}} and is correct with high probability. On the word RAM, the algorithm runs in n3/2Ω(logn)1/2+n2+o(1)logMn^3/2^{\Omega(\log n)^{1/2}} + n^{2+o(1)}\log M time for edge weights in ([0,M]Z){}([0,M] \cap {\mathbb Z})\cup\{\infty\}. Prior algorithms used either n3/(logcn)n^3/(\log^c n) time for various c2c \leq 2, or O(Mαnβ)O(M^{\alpha}n^{\beta}) time for various α>0\alpha > 0 and β>2\beta > 2. The new algorithm applies a tool from circuit complexity, namely the Razborov-Smolensky polynomials for approximately representing AC0[p]{\sf AC}^0[p] circuits, to efficiently reduce a matrix product over the (min,+)(\min,+) algebra to a relatively small number of rectangular matrix products over F2{\mathbb F}_2, each of which are computable using a particularly efficient method due to Coppersmith. We also give a deterministic version of the algorithm running in n3/2logδnn^3/2^{\log^{\delta} n} time for some δ>0\delta > 0, which utilizes the Yao-Beigel-Tarui translation of AC0[m]{\sf AC}^0[m] circuits into "nice" depth-two circuits.Comment: 24 pages. Updated version now has slightly faster running time. To appear in ACM Symposium on Theory of Computing (STOC), 201
    corecore